Overview

Pre-mature optimization is the root of all evil

A good framework to keep in mind is:

  1. Make it work -- you should first make sure your code is working properly, and make sure you save/version-control the correctly copy. Fast, but buggy code doesn't help anyone.

  2. Make it clean -- make it easy to read + understand your code. This'll help you understand where potential pinch-points are, and it'll also help others read your code. If your code is an unmaintainable mess, people (include future-you) isn't going to want to touch your code, even if it is fast.

  3. Make it fast -- now, and only now, can you start speeding things up.

I'll assume you've already done #1 and #2. You're now interested in #3.

A few strategies we won't talk about

  1. Run things in parallel. My desktop has 12 cores. If I only run my analysis on 1 core while leaving the other 11 idle, I'll be sitting there all day like a chump. Using multiprocessing.Pool.map is an easy way to get started. For more complex parallelism, check out AMS250 - High Performance Computing.

  2. Switch to a compiled language. This is a pain, especially if you're using a lot of special functions from python packages. But if you're using >100 cores, you have relatively few dependences, and your thesis depends on large numerical calculations, then this is your best bet.

  3. Cython-ize your code. Cython theoretically lets you get the benefits of compiled C code, while still having easy-access in python. Jake VanderPlas has a recent demo. In practice it's a bit of a pain.

Techniques we will talk about

  1. Vectorizing/broadcasting for loops
  2. Profiling ("where do I need to optimize?")
    1. manual timing (simple, yet coarse)
    2. [`line_profiler`](https://github.com/rkern/line_profiler)
    3. [`cProfile`](https://docs.python.org/3/library/profile.html#module-cProfile)
  3. numba

1) Vectorize for loops

Any time you use for i in range(<size>):, it should raise a red flag in your mind. Take a moment to think: does the functionality exist within numpy?

This is generally possible if:

  1. You're doing the same thing to each element in an array
  2. You're doing algebraic manipulations, or "reduction"-type operations (like max, sum, mean, etc.) Algebra happens automagically; if you're doing a common operation, it's worth searching to see if numpy already has it.

In [1]:
import numpy as np

Simple example


In [2]:
image_size = 1000
image = np.random.random((image_size, image_size))

In [3]:
%%timeit 
sum_sq = 0
for i in range(image_size):
    for j in range(image_size):
        sum_sq += image[i, j]**2
        
rms = (sum_sq / (image_size**2))**.5


468 ms ± 7.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]:
%%timeit
rms = (image**2).mean()**.5


1.38 ms ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Wow! A factor of a 100 speed up! (And it's a bit easier to read.)

Note: This works great with algebra, but doesn't tend to work for things like doing calculus/numerical integration for different starting points.

numpy.vectorize exists, but it's primarily to make code pretty, rather than making it fast. It makes things look vectorized, but is actually just a for loop.

More complex example

We want the coordinates of a 2D grid (1000 x 1000). At each grid point, I want the x and y coodinate. These coordinates will be stored as 2 arrays that are each 2D (e.g. X will be 1000x1000, and will contain the x coordinate at each location.)


In [5]:
size=1000

xs = range(size)
ys = range(size)

In [6]:
%%timeit

X = np.empty((size, size))
Y = np.empty((size, size))

for i, x in enumerate(xs):
    for j, y in enumerate(ys):
        X[i, j] = x
        Y[i, j] = y


321 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]:
%%timeit
Y, X = np.meshgrid(xs, ys)


2.86 ms ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pretty good! Still a factor of 100 speed up!

If you look at the numpy documentation for meshgrid, you'll see it suggest "index_tricks". These can also be powerful, but quickly become much less readable, so we're not going to talk about them now.

2) Profiling

Profiling won't speed up your code; it'll just help you isolate what's slowing it down.

2a) timing: the "poor man's profiler"


In [8]:
%%time 

size=20000

xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)

np.savetxt("tmp.txt", X)


CPU times: user 2min 39s, sys: 10.3 s, total: 2min 50s
Wall time: 2min 52s

In [9]:
!ls -lh tmp.txt


-rw-r--r--  1 egentry  staff   9.3G Jan 25 16:46 tmp.txt

In [10]:
%%time 

size=20000

xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)

# np.savetxt("tmp.txt", X)


CPU times: user 1.83 s, sys: 3.13 s, total: 4.96 s
Wall time: 5.22 s

You can also time separate parts of code:


In [11]:
import datetime

In [12]:
size=2000

start_time_computation = datetime.datetime.now()
xs = range(size)
ys = range(size)
Y, X = np.meshgrid(xs, ys)
end_time_computation = datetime.datetime.now()

start_time_io = datetime.datetime.now()
np.savetxt("tmp.txt", X)
end_time_io = datetime.datetime.now()

In [13]:
run_time_computation = end_time_computation - start_time_computation
run_time_io = end_time_io - start_time_io

print("Run time (computation): ", run_time_computation)
print("Run time (io):          ", run_time_io)


Run time (computation):  0:00:00.424671
Run time (io):           0:00:01.607549

This works fine for quick checks, but it assumes:

  1. You have a pretty good idea of where the slowdown is. It's a pain to have to re-run this for every line of code.
    • Even with multiple timers and doing a binary search through your code, it still might take a while to find exactly what lines are slowing you down.
  2. The "comment things out" approach clearly doesn't work if commenting out a line causes your code to crash. A more subtle issue: even if your code still works, the rest of the code might have drastically different runtime. (For example, turning cooling off in my hydro sims leads to a totally different problem.)

So this is pretty limited for more complex bits of code.

2b) line_profiler

This will make it easy to see where the slow down is happening. And it profiles the entire script/function, so you don't have to keep re-running your code with timers at slightly different locations.


In [14]:
def simple_sum_sq(size):
    image = np.random.random((size, size))
    
    sum_sq = 0
    for i in range(size):
        for j in range(size):
            sum_sq += image[i, j]**2

In [15]:
%load_ext line_profiler

In [16]:
%lprun -f simple_sum_sq simple_sum_sq(200)

Example output:

Timer unit: 1e-06 s

Total time: 0.060595 s
File: <ipython-input-20-d9bf5ccb538f>
Function: simple_sum_sq at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def simple_sum_sq(size):
     2         1        821.0    821.0      1.4      image = np.random.random((size, size))
     3                                               
     4         1          1.0      1.0      0.0      sum_sq = 0
     5       201         95.0      0.5      0.2      for i in range(size):
     6     40200      16937.0      0.4     28.0          for j in range(size):
     7     40000      42741.0      1.1     70.5              sum_sq += image[i, j]**2

That's a pretty nice overview of how long it takes to run a certain line and how often that line gets run. While the image = ... line takes the longest per-run, the sum_sq += ... line gets run way more, and ends up taking up ~70% of the runtime.

I show a more complex profiling example near the bottom of this notebook ("Real World Example")

2c) cProfile

This is great if you want to see which functions are slow, but don't care where they are located within your code.

This is good for a very complex project, with lots of nested functions. This way you know that calculate_pressure is taking up a bunch of time, even though it's located at 20 locations in your code (and at each location it's not noticably slow).

(Note: I haven't used cProfile much, so I don't know too much about it...)


In [17]:
import cProfile

In [18]:
def create_mesh_grid(size):
    size=20000

    xs = range(size)
    ys = range(size)
    Y, X = np.meshgrid(xs, ys)

In [19]:
cProfile.run("create_mesh_grid(20000)")


         48 function calls in 3.601 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.303    3.303 <ipython-input-18-c2834fc493bc>:1(create_mesh_grid)
        1    0.298    0.298    3.601    3.601 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 function_base.py:213(iterable)
        1    0.000    0.000    3.303    3.303 function_base.py:4554(meshgrid)
        1    0.000    0.000    0.005    0.005 function_base.py:4671(<listcomp>)
        1    0.000    0.000    3.298    3.298 function_base.py:4684(<listcomp>)
        2    0.000    0.000    0.005    0.003 numeric.py:534(asanyarray)
        2    0.000    0.000    0.000    0.000 stride_tricks.py:115(_broadcast_to)
        6    0.000    0.000    0.000    0.000 stride_tricks.py:120(<genexpr>)
        1    0.000    0.000    0.000    0.000 stride_tricks.py:176(_broadcast_shape)
        1    0.000    0.000    0.000    0.000 stride_tricks.py:195(broadcast_arrays)
        1    0.000    0.000    0.000    0.000 stride_tricks.py:247(<listcomp>)
        2    0.000    0.000    0.000    0.000 stride_tricks.py:25(_maybe_view_as_subclass)
        2    0.000    0.000    0.000    0.000 stride_tricks.py:251(<genexpr>)
        1    0.000    0.000    0.000    0.000 stride_tricks.py:257(<listcomp>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.all}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.any}
        1    0.000    0.000    3.601    3.601 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        6    0.005    0.001    0.005    0.001 {built-in method numpy.core.multiarray.array}
        2    3.298    1.649    3.298    1.649 {method 'copy' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}


(Sorry that this example really isn't complicated enough to truly show the power of cProfile)

3) numba

So you found a simple function, which is run a bunch of times, and it's making your code super slower. How do you make it faster?

numba is a great approach for keeping the simplicity + readability of pure-python, while still getting the speed-up that you might get with compiled C (either native C or cython). It's uses "just-in-time" (JIT) compiler; if you use a function repeatedly, with predictable inputs, it'll compile a version of that function for those inputs.


In [20]:
def simple_sum_sq(size):
    image = np.random.random((size, size))
    
    sum_sq = 0
    for i in range(size):
        for j in range(size):
            sum_sq += image[i, j]**2

In [21]:
%%timeit

simple_sum_sq(1000)


462 ms ± 9.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]:
import numba

(all I've changed in the next cell is I've added the decorator, @numba.jit())


In [23]:
@numba.jit()
def simple_sum_sq_new(size):
    image = np.random.random((size, size))
    
    sum_sq = 0
    for i in range(size):
        for j in range(size):
            sum_sq += image[i, j]**2

In [24]:
%%timeit

simple_sum_sq_new(1000)


8.93 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's pretty good. We didn't change a single bit of our code; just added a decorator, and now we have a factor of 50x speed up. Not quite 100x from vectorizing with numpy, but still pretty good given that it required less work.

In general, this is great for simple, mathematical, repetitive tasks with predictable input types. You probably can't just @numba.jit your entire code. But, hey, I've never tried.

As an added bonus, it doesn't conflict with vectorizing your code. In some cases it can even have a minor bonus from doing both:

vectorized


In [25]:
def calculate_kinetic_energy(mass, velocity):
    E_kin = (1/2) * mass * velocity**2
    return E_kin

In [26]:
%%timeit

masses = np.random.random(10000)
velocities = np.random.random(masses.shape)
calculate_kinetic_energy(masses, velocities)


246 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vectorized + jit


In [27]:
@numba.jit(nopython=True)
def calculate_kinetic_energy_jit(mass, velocity):
    E_kin = (1/2) * mass * velocity**2
    return E_kin

In [28]:
%%timeit

masses = np.random.random(10000)
velocities = np.random.random(masses.shape)
calculate_kinetic_energy_jit(masses, velocities)


234 µs ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Real World Example

This comes from my 1st+2nd year project: https://github.com/egentry/clustered_SNe/commit/848cca6a08c3acb1b666e094cb5ae387480cd224

I've simplified things here, but the core idea is the same.

old, slow version


In [29]:
import pandas as pd

In [30]:
cols = ["Radius", "dR", "dV", "Density", 
        "Pressure", "Velocity", "Z", 
        "Temperature", "Energy", "Entropy", 
        "Mass", "M_int", "C_ad", "Crossing_time"]
cols_in   = cols[:-7]

In [31]:
from helper_functions import calculate_mass, \
                                          calculate_mean_molecular_weight, \
                                          calculate_kinetic_energy, \
                                          calculate_internal_energy, \
                                          calculate_momentum, \
                                          calculate_c_ad, \
                                          calculate_entropy, \
                                          calculate_temperature, \
                                          calculate_w_cell, \
                                          calculate_crossing_time

In [32]:
def process_checkpoints(num_checkpoints):
    df = pd.DataFrame()
    for k in range(num_checkpoints):
        
        # Create some fake data
        array_tmp = np.random.random((1000,7))
        array_tmp = array_tmp[1:-1] # drop cells with bad data
        
        
        # transform the array into a dataframe
        index     = pd.MultiIndex.from_product([[k], np.arange(array_tmp.shape[0])],
                                               names=["k","i"])
        df_tmp    = pd.DataFrame(array_tmp, columns=cols_in, index = index)

        df_tmp["Mass"]        = calculate_mass(df_tmp.Density.values,
                                               df_tmp.dV.values)
        df_tmp["M_int"]       = df_tmp.Mass.cumsum()
        df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
                                                      df_tmp.Density.values, .67)
        df_tmp["Energy"]      = calculate_internal_energy(df_tmp.Mass.values,
                                                          df_tmp.Pressure.values,
                                                          df_tmp.Density.values) \
                                                           / df_tmp.Mass.values
        df_tmp["C_ad"]        = calculate_c_ad(df_tmp.Pressure.values,
                                               df_tmp.Density.values)
        
        w_cell = calculate_w_cell(df_tmp.Velocity.values)
        df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
                                                          df_tmp.Velocity.values,
                                                          w_cell,
                                                          df_tmp.dR.values )

        df = pd.concat([df, df_tmp])
        
    return df

In [33]:
%time df_old = process_checkpoints(600)


CPU times: user 17.3 s, sys: 9.81 s, total: 27.1 s
Wall time: 27.1 s

In [34]:
df_old.head()


Out[34]:
Radius dR dV Density Pressure Velocity Z Mass M_int Temperature Energy C_ad Crossing_time
k i
0 0 0.281034 0.858699 0.072578 0.397503 0.631955 0.575377 0.178266 0.028850 0.028850 1.290430e-08 2.384715 1.627785 0.389757
1 0.075958 0.064777 0.315294 0.022397 0.854481 0.717964 0.733165 0.007062 0.035912 3.096744e-07 57.227825 7.974113 0.008051
2 0.563126 0.594142 0.653829 0.711083 0.543120 0.702907 0.425620 0.464927 0.500838 6.199618e-09 1.145689 1.128268 0.523106
3 0.853591 0.207250 0.707058 0.953489 0.935388 0.047228 0.584556 0.674172 1.175010 7.962792e-09 1.471524 1.278682 0.129006
4 0.351412 0.297804 0.084097 0.780042 0.019829 0.163382 0.333089 0.065599 1.240610 2.063389e-10 0.038131 0.205836 1.128418

When I was just processing a single simulation, 30sec wasn't too bad, so I didn't worry about it. But then I wanted to process batches of hundreds of simulations. In order to do that I had to speed things up.

But what's the major slow down? Let's try line profiling!


In [35]:
%lprun -f process_checkpoints process_checkpoints(300)

Sample output:

Timer unit: 1e-06 s

Total time: 8.40759 s
File: <ipython-input-34-c7b04c4d293b>
Function: process_checkpoints at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def process_checkpoints(num_checkpoints):
     2         1        800.0    800.0      0.0      df = pd.DataFrame()
     3       301        863.0      2.9      0.0      for k in range(num_checkpoints):
     4                                                   
     5                                                   # Create some fake data
     6       300      27055.0     90.2      0.3          array_tmp = np.random.random((1000,7))
     7       300        851.0      2.8      0.0          array_tmp = array_tmp[1:-1] # drop cells with bad data
     8                                                   
     9                                                   
    10                                                   # transform the array into a dataframe
    11       300       2051.0      6.8      0.0          index     = pd.MultiIndex.from_product([[k], np.arange(array_tmp.shape[0])],
    12       300     496092.0   1653.6      5.9                                                 names=["k","i"])
    13       300     102367.0    341.2      1.2          df_tmp    = pd.DataFrame(array_tmp, columns=cols_in, index = index)
    14                                           
    15       300      47172.0    157.2      0.6          df_tmp["Mass"]        = calculate_mass(df_tmp.Density.values,
    16       300     207859.0    692.9      2.5                                                 df_tmp.dV.values)
    17       300     297848.0    992.8      3.5          df_tmp["M_int"]       = df_tmp.Mass.cumsum()
    18       300      37834.0    126.1      0.4          df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
    19       300     196617.0    655.4      2.3                                                        df_tmp.Density.values, .67)
    20       300      37470.0    124.9      0.4          df_tmp["Energy"]      = calculate_internal_energy(df_tmp.Mass.values,
    21       300      30342.0    101.1      0.4                                                            df_tmp.Pressure.values,
    22       300      32599.0    108.7      0.4                                                            df_tmp.Density.values) \
    23       300     171803.0    572.7      2.0                                                             / df_tmp.Mass.values
    24       300      37491.0    125.0      0.4          df_tmp["C_ad"]        = calculate_c_ad(df_tmp.Pressure.values,
    25       300     197602.0    658.7      2.4                                                 df_tmp.Density.values)
    26                                                   
    27       300      53682.0    178.9      0.6          w_cell = calculate_w_cell(df_tmp.Velocity.values)
    28       300      32733.0    109.1      0.4          df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
    29       300       5134.0     17.1      0.1                                                            df_tmp.Velocity.values,
    30       300        349.0      1.2      0.0                                                            w_cell,
    31       300     195733.0    652.4      2.3                                                            df_tmp.dR.values )
    32                                           
    33       300    6195246.0  20650.8     73.7          df = pd.concat([df, df_tmp])
    34                                                   
    35         1          1.0      1.0      0.0      return df

new, faster version

So the old problem was with pd.concat. Since I didn't initially tell it how large the final dataframe would be, it had to keep creating new objects, and copying data from the old object. (This led to $\mathcal{O}(n^2)$ scaling, with respect to the number of snapshots.)

If instead I create a dataframe with the correct size to begin with, it should go much faster (and be $\mathcal{O}(n)$).

Before we do that, I'll just show an empty version of the dataframe, to give you a sense of what I mean.


In [36]:
dataframe_columns = ["Radius",
                     "dR",
                     "dV",
                     "Density",
                     "Pressure",
                     "Velocity",
                     "Z",
                     "Mass",
                     "M_int",
                     "Temperature",
                     "Energy",
                     "C_ad",
                     "Crossing_time",
                    ]


num_checkpoints=10
num_zones=20
index     = pd.MultiIndex.from_product([np.arange(num_checkpoints), 
                                        np.arange(num_zones)],
                                       names=["k","i"])
df = pd.DataFrame(index=index, 
                  columns=dataframe_columns,
                  dtype=float)

In [37]:
df


Out[37]:
Radius dR dV Density Pressure Velocity Z Mass M_int Temperature Energy C_ad Crossing_time
k i
0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8 10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

200 rows × 13 columns


In [38]:
def process_checkpoints_new(num_checkpoints):
    num_zones = 1000
    
    dataframe_columns = ["Radius",
                         "dR",
                         "dV",
                         "Density",
                         "Pressure",
                         "Velocity",
                         "Z",
                         "Mass",
                         "M_int",
                         "Temperature",
                         "Energy",
                         "C_ad",
                         "Crossing_time",
                        ]

    index     = pd.MultiIndex.from_product([np.arange(num_checkpoints), 
                                            np.arange(num_zones-2)],
                                           names=["k","i"])
    df = pd.DataFrame(index=index, 
                      columns=dataframe_columns,
                      dtype=float)
    
    for k in range(num_checkpoints):
        
        # Create some fake data
        array_tmp = np.random.random((num_zones,7))
        array_tmp = array_tmp[1:-1] # drop cells with bad data
        
        # adding the data into the dataframe is much more complicated now
        df_tmp = df.loc[k]
        df_tmp[cols_in] = array_tmp

        df_tmp["Mass"]        = calculate_mass(df_tmp.Density.values,
                                               df_tmp.dV.values)
        df_tmp["M_int"]       = df_tmp.Mass.cumsum()
        df_tmp["Temperature"] = calculate_temperature(df_tmp.Pressure.values,
                                                      df_tmp.Density.values, .67)
        df_tmp["Energy"]      = calculate_internal_energy(df_tmp.Mass.values,
                                                          df_tmp.Pressure.values,
                                                          df_tmp.Density.values) \
                                                           / df_tmp.Mass.values
        df_tmp["C_ad"]        = calculate_c_ad(df_tmp.Pressure.values,
                                               df_tmp.Density.values)
        
        w_cell = calculate_w_cell(df_tmp.Velocity.values)
        df_tmp["Crossing_time"] = calculate_crossing_time(df_tmp.C_ad.values,
                                                          df_tmp.Velocity.values,
                                                          w_cell,
                                                          df_tmp.dR.values )
    return df

In [39]:
%time df_new = process_checkpoints_new(600)


CPU times: user 2.04 s, sys: 45.1 ms, total: 2.08 s
Wall time: 2.08 s

Great! A factor of ~10 speed-up, on a mostly-real-world problem.

Note though:

  1. Even once you identify what is slow, it's not always obvious exactly how to improve it.
  2. It tends to make the code more complex. If you've played with indexing in pandas you probably know that you have to be careful getting a view (df_tmp = df.loc[k]) rather than a copy.
  3. In practice each checkpoint file has a variable number of lines. This would make the code even more complex, which led to a Python 2 vs. 3 error that I didn't catch for months.

So it's great to speed things up, but it does often come at a cost.